Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
kakamana
January 21, 2023
Get more out of your XGBoost skills with two end-to-end machine learning pipelines using your models. You’ll learn how to tune the most important XGBoost hyperparameters efficiently within a pipeline, as well as how to use advanced preprocessing techniques.
This Using XGBoost in pipelines is part of Datacamp course: Extreme Gradient Boosting with XGBoost
This is my learning experience of data science through DataCamp
cross_val_score
methodsNow that you’ve seen what will need to be done to get the housing data ready for XGBoost, let’s go through the process step-by-step.
First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically.
The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You’ll practice using this here.
MSSubClass | MSZoning | LotFrontage | LotArea | Neighborhood | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | ... | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | Fireplaces | GarageArea | PavedDrive | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | RL | 65.0 | 8450 | CollgCr | 1Fam | 2Story | 7 | 5 | 2003 | ... | 1710 | 1 | 0 | 2 | 1 | 3 | 0 | 548 | Y | 208500 |
1 | 20 | RL | 80.0 | 9600 | Veenker | 1Fam | 1Story | 6 | 8 | 1976 | ... | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | 460 | Y | 181500 |
2 | 60 | RL | 68.0 | 11250 | CollgCr | 1Fam | 2Story | 7 | 5 | 2001 | ... | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | 608 | Y | 223500 |
3 | 70 | RL | 60.0 | 9550 | Crawfor | 1Fam | 2Story | 7 | 5 | 1915 | ... | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | 642 | Y | 140000 |
4 | 60 | RL | 84.0 | 14260 | NoRidge | 1Fam | 2Story | 8 | 5 | 2000 | ... | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | 836 | Y | 250000 |
5 rows × 21 columns
from sklearn.preprocessing import LabelEncoder
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)
# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == 'object')
# Get list of categorical columns names
categorical_columns = df.columns[categorical_mask].tolist()
# Print the head of the categorical columns
print(df[categorical_columns].head())
# Create LabelEncoder object: le
le = LabelEncoder()
# Apply LabelEncode to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())
print("\nNotice how the entries in each categorical column are now encoded numerically. A BldgTpe of 1Fam is encoded as 0, while a HouseStyle of 2Story is encoded as 5")
MSZoning Neighborhood BldgType HouseStyle PavedDrive
0 RL CollgCr 1Fam 2Story Y
1 RL Veenker 1Fam 1Story Y
2 RL CollgCr 1Fam 2Story Y
3 RL Crawfor 1Fam 2Story Y
4 RL NoRidge 1Fam 2Story Y
MSZoning Neighborhood BldgType HouseStyle PavedDrive
0 3 5 0 5 2
1 3 24 0 2 2
2 3 5 0 5 2
3 3 6 0 5 2
4 3 15 0 5 2
Notice how the entries in each categorical column are now encoded numerically. A BldgTpe of 1Fam is encoded as 0, while a HouseStyle of 2Story is encoded as 5
Okay - so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder
, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker “greater” than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.
As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or “dummy” variables. You can do this using scikit-learn’s OneHotEncoder.
Warning: Instead using LabelEncoder, use make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.read_csv('dataset/ames_unprocessed_data.csv')
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)
# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == 'object')
# Get list of categorical columns names
categorical_columns = df.columns[categorical_mask].tolist()
# Generate unique list of each categorical columns
unique_list = [df[c].unique().tolist() for c in categorical_columns]
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categories=unique_list)
# Create preprocess object for onehotencoding
preprocess = make_column_transformer(
(ohe, categorical_columns),
('passthrough', categorical_mask[~categorical_mask].index.tolist())
)
# apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = preprocess.fit_transform(df)
# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])
# Print the shape fo the original DataFrame
print(df.shape)
# Print the shape of the transformed array
print(df_encoded.shape)
[[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 6.000e+01 6.500e+01 8.450e+03
7.000e+00 5.000e+00 2.003e+03 0.000e+00 1.710e+03 1.000e+00 0.000e+00
2.000e+00 1.000e+00 3.000e+00 0.000e+00 5.480e+02 2.085e+05]
[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 2.000e+01 8.000e+01 9.600e+03
6.000e+00 8.000e+00 1.976e+03 0.000e+00 1.262e+03 0.000e+00 1.000e+00
2.000e+00 0.000e+00 3.000e+00 1.000e+00 4.600e+02 1.815e+05]
[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 6.000e+01 6.800e+01 1.125e+04
7.000e+00 5.000e+00 2.001e+03 1.000e+00 1.786e+03 1.000e+00 0.000e+00
2.000e+00 1.000e+00 3.000e+00 1.000e+00 6.080e+02 2.235e+05]
[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 7.000e+01 6.000e+01 9.550e+03
7.000e+00 5.000e+00 1.915e+03 1.000e+00 1.717e+03 1.000e+00 0.000e+00
1.000e+00 0.000e+00 3.000e+00 1.000e+00 6.420e+02 1.400e+05]
[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 6.000e+01 8.400e+01 1.426e+04
8.000e+00 5.000e+00 2.000e+03 0.000e+00 2.198e+03 1.000e+00 0.000e+00
2.000e+00 1.000e+00 4.000e+00 1.000e+00 8.360e+02 2.500e+05]]
(1460, 21)
(1460, 62)
Alright, one final trick before you dive into pipelines. The two step process you just went through - LabelEncoder
followed by OneHotEncoder
- can be simplified by using a DictVectorizer.
Using a DictVectorizer
on a DataFrame that has been converted to a dictionary allows you to get label encoding as well as one-hot encoding in one go.
Your task is to work through this strategy in this exercise!
from sklearn.feature_extraction import DictVectorizer
# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
# Apply dv on df: df_encoded
df_encoded2 = dv.fit_transform(df_dict)
# Print the resulting first five rows
print(df_encoded2[:5, :])
# Print the vocabulary
print(dv.vocabulary_)
print("\nBesides simplifying the process into one step, DictVectorizer has useful attributes such as vocabulary_ which maps the names of the features to their indices. ")
[[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 2.000e+00 5.480e+02 1.710e+03 1.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
8.450e+03 6.500e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 2.085e+05 2.003e+03]
[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 1.000e+00 2.000e+00 4.600e+02 1.262e+03 0.000e+00 0.000e+00
0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
9.600e+03 8.000e+01 2.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 8.000e+00 6.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 1.815e+05 1.976e+03]
[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 1.000e+00 2.000e+00 6.080e+02 1.786e+03 1.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
1.125e+04 6.800e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.235e+05 2.001e+03]
[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 1.000e+00 1.000e+00 6.420e+02 1.717e+03 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
9.550e+03 6.000e+01 7.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
0.000e+00 0.000e+00 1.000e+00 1.000e+00 1.400e+05 1.915e+03]
[4.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 1.000e+00 2.000e+00 8.360e+02 2.198e+03 1.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
1.426e+04 8.400e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 8.000e+00
0.000e+00 0.000e+00 1.000e+00 0.000e+00 2.500e+05 2.000e+03]]
{'MSSubClass': 23, 'MSZoning=RL': 27, 'LotFrontage': 22, 'LotArea': 21, 'Neighborhood=CollgCr': 34, 'BldgType=1Fam': 1, 'HouseStyle=2Story': 18, 'OverallQual': 55, 'OverallCond': 54, 'YearBuilt': 61, 'Remodeled': 59, 'GrLivArea': 11, 'BsmtFullBath': 6, 'BsmtHalfBath': 7, 'FullBath': 9, 'HalfBath': 12, 'BedroomAbvGr': 0, 'Fireplaces': 8, 'GarageArea': 10, 'PavedDrive=Y': 58, 'SalePrice': 60, 'Neighborhood=Veenker': 53, 'HouseStyle=1Story': 15, 'Neighborhood=Crawfor': 35, 'Neighborhood=NoRidge': 44, 'Neighborhood=Mitchel': 40, 'HouseStyle=1.5Fin': 13, 'Neighborhood=Somerst': 50, 'Neighborhood=NWAmes': 43, 'MSZoning=RM': 28, 'Neighborhood=OldTown': 46, 'Neighborhood=BrkSide': 32, 'BldgType=2fmCon': 2, 'HouseStyle=1.5Unf': 14, 'Neighborhood=Sawyer': 48, 'Neighborhood=NridgHt': 45, 'Neighborhood=NAmes': 41, 'BldgType=Duplex': 3, 'Neighborhood=SawyerW': 49, 'Neighborhood=IDOTRR': 38, 'PavedDrive=N': 56, 'Neighborhood=MeadowV': 39, 'BldgType=TwnhsE': 5, 'MSZoning=C (all)': 24, 'Neighborhood=Edwards': 36, 'Neighborhood=Timber': 52, 'PavedDrive=P': 57, 'HouseStyle=SFoyer': 19, 'MSZoning=FV': 25, 'Neighborhood=Gilbert': 37, 'HouseStyle=SLvl': 20, 'BldgType=Twnhs': 4, 'Neighborhood=StoneBr': 51, 'HouseStyle=2.5Unf': 17, 'Neighborhood=ClearCr': 33, 'Neighborhood=NPkVill': 42, 'HouseStyle=2.5Fin': 16, 'Neighborhood=Blmngtn': 29, 'Neighborhood=BrDale': 31, 'Neighborhood=SWISU': 47, 'MSZoning=RH': 26, 'Neighborhood=Blueste': 30}
Besides simplifying the process into one step, DictVectorizer has useful attributes such as vocabulary_ which maps the names of the features to their indices.
Now that you’ve seen what steps need to be taken individually to properly process the Ames housing data, let’s use the much cleaner and more succinct DictVectorizer approach and put it alongside an XGBoostRegressor inside of a scikit-learn pipeline.
from sklearn.pipeline import Pipeline
# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)
# Setup the pipeline steps: steps
steps = [('ohe_onestep', DictVectorizer(sparse=False)),
('xgb_model', xgb.XGBRegressor())]
# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)
# Fit the pipeline
xgb_pipeline.fit(X.to_dict("records"), y)
Pipeline(steps=[('ohe_onestep', DictVectorizer(sparse=False)), ('xgb_model', XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('ohe_onestep', DictVectorizer(sparse=False)), ('xgb_model', XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...))])
DictVectorizer(sparse=False)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
sklearn_pandas
:
DataFrameMapper
- Interoperability between pandas and scikit-learnCategoricalImputer
- Allow for imputation of categorical variables before conversion to integerssklearn.preprocessing
:
Imputer
- Native imputation of numerical columns in scikit-learnsklearn.pipeline
:
FeatureUnion
- combine multiple pipelines of features into a single pipeline of featuresIn this exercise, you’ll go one step further by using the pipeline you’ve created to preprocess and cross-validate your model.
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)
# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
("xgb_model", xgb.XGBRegressor(max_depth=2, objective='reg:squarederror'))]
# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)
# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict('records'), y,
scoring='neg_mean_squared_error', cv=10)
# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))
10-fold RMSE: 27683.04157118635
You’ll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.
As Sergey mentioned in the video, you’ll be introduced to a new library, sklearn_pandas, that allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, you’ll be able to impute missing categorical values directly using the Categorical_Imputer()
class in sklearn_pandas, and the DataFrameMapper()
class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.
We’ve also created a transformer called a Dictifier
that encapsulates converting a DataFrame using .to_dict("records")
without you having to do it explicitly (and so that it works in a pipeline). Finally, we’ve also provided the list of feature names in kidney_feature_names, the target name in kidney_target_name, the features in X, and the target in y.
In this exercise, your task is to apply the CategoricalImputer
to impute all of the categorical columns in the dataset. You can refer to how the numeric imputation mapper was created as a template. Notice the keyword arguments input_df=True
and df_out=True
? This is so that you can work with DataFrames instead of arrays. By default, the transformers are passed a numpy array of the selected columns as input, and as a result, the output of the DataFrame mapper is also an array. Scikit-learn transformers have historically been designed to work with numpy arrays, not pandas DataFrames, even though their basic indexing interfaces are similar.
Requirement already satisfied: sklearn_pandas in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (2.2.0)
Requirement already satisfied: scikit-learn>=0.23.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from sklearn_pandas) (1.1.2)
Requirement already satisfied: numpy>=1.18.1 in c:\users\dghr201\appdata\roaming\python\python39\site-packages (from sklearn_pandas) (1.23.2)
Requirement already satisfied: pandas>=1.1.4 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from sklearn_pandas) (1.4.2)
Requirement already satisfied: scipy>=1.5.1 in c:\users\dghr201\appdata\roaming\python\python39\site-packages (from sklearn_pandas) (1.9.1)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from pandas>=1.1.4->sklearn_pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from pandas>=1.1.4->sklearn_pandas) (2022.1)
Requirement already satisfied: six>=1.5 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.8.1->pandas>=1.1.4->sklearn_pandas) (1.16.0)
Requirement already satisfied: joblib>=1.0.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (3.1.0)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: You are using pip version 21.2.3; however, version 22.3.1 is available.
You should consider upgrading via the 'C:\Users\dghr201\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.
from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.impute import SimpleImputer
# Check number of nulls in each feature columns
nulls_per_column = X.isnull().sum()
print(nulls_per_column)
# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object
# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()
# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()
# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
[([numeric_feature], SimpleImputer(strategy='median'))
for numeric_feature in non_categorical_columns],
input_df=True,
df_out=True
)
# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
[(category_feature, CategoricalImputer())
for category_feature in categorical_columns],
input_df=True,
df_out=True
)
ImportError: cannot import name 'CategoricalImputer' from 'sklearn_pandas' (C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn_pandas\__init__.py)
Having separately imputed numeric as well as categorical columns, your task is now to use scikit-learn’s FeatureUnion to concatenate their results, which are contained in two separate transformer objects - numeric_imputation_mapper
, and categorical_imputation_mapper
, respectively.
Just like with pipelines, you have to pass it a list of (string, transformer)
tuples, where the first half of each tuple is the name of the transformer.
It’s time to piece together all of the transforms along with an XGBClassifier
to build the full pipeline!
Besides the numeric_categorical_union
that you created in the previous exercise, there are two other transforms needed: the Dictifier()
transform which we created for you, and the DictVectorizer()
.
After creating the pipeline, your task is to cross-validate it to see how well it performs.
from sklearn.base import BaseEstimator, TransformerMixin
# Define Dictifier class to turn df into dictionary as part of pipeline
class Dictifier(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
if type(X) == pd.core.frame.DataFrame:
return X.to_dict("records")
else:
return pd.DataFrame(X).to_dict("records")
# Create full pipeline
pipeline = Pipeline([
("featureunion", numeric_categorical_union),
("dictifier", Dictifier()),
("vectorizer", DictVectorizer(sort=False)),
("clf", xgb.XGBClassifier(max_depth=3))
])
# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=3)
# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))
Alright, it’s time to bring together everything you’ve learned so far! In this final exercise of the course, you will combine your work from the previous exercises into one end-to-end XGBoost pipeline to really cement your understanding of preprocessing and pipelines in XGBoost.
Your work from the previous 3 exercises, where you preprocessed the data and set up your pipeline, has been pre-loaded. Your job is to perform a randomized search and identify the best hyperparameters.
from sklearn.model_selection import RandomizedSearchCV
# Create the parameter grid
gbm_param_grid = {
'clf__learning_rate': np.arange(0.05, 1, 0.05),
'clf__max_depth': np.arange(3, 10, 1),
'clf__n_estimators': np.arange(50, 200, 50)
}
# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=gbm_param_grid,
n_iter=2, scoring='roc_auc', cv=2, verbose=1)
# Fit the estimator
randomized_roc_auc.fit(X, y)
# Compute metrics
print('Score: ', randomized_roc_auc.best_score_)
print('Estimator: ', randomized_roc_auc.best_estimator_)
NameError: name 'pipeline' is not defined